Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

repare a data set ܠൌሺݔଵ, ݔଶ, ⋯, ݔேሻ.

Design bins ܤൌሺܤଵ, ܤଶ, ⋯, ܤெሻ such as the eight bins (M = 8)

hown in the top-right panel of Figure 2.2.

Throw each data point ݔ௡ into a bin ܤ௠ if ݔ௡ is within the

oundaries of ܤ௠.

Count the data points (ݔ௡) falling in each bin ܤ௠, such as 1,277

nsertions in the first bin in Figure 2.2.

Denote the data point count of each bin as ݂௠ (frequency) as

hown in the bottom-left panel of Figure 2.2.

Convert each frequency (݂௠) to a probability (݌௠) as shown in the

ottom-right panel of Figure 2.2.

Use either ݂௠ or ݌௠ to interpret the estimated density.

ormula for converting a frequency histogram to a probabilistic

m is defined as below,

݌௠ൌ෍^݂^௠

௡

௞

௞ୀଵ

(2.1)

histogram statistics have different meanings. ݂௠ stands for the

nt of the m^th bin while ݌௠ stands for the probability for data to

he m^th bin. The former can be used for interpreting or explaining

are distributed. The latter can be used for making an inference

data. In other words, the probability of a bin can indicate how

novel data point occurs in the bin. For instance, the probability

vel transposon insertion statistic falls into the first bin and the

bin were 8% and 19%, respectively for the gene isftu1 shown in

m-right panel of Figure 2.2.

R function for the histogram algorithm is hist. Given a vector x,

x of the hist function is shown below, where nclass is the

r for designing the bin number,

hist(x,nclass,···)